Credit Card Default Predictor (DSCI Group Project Proposal)¶

Introduction¶

This project is to come up with a classification predictive model of whether the an account will default on his/her next credit card payment. The dataset is downloaded from Kaggle.

From the dataset, we see the each observation constitude one client, with basic info such as sex, education, etc. A list of repayment status for each month (PAY_n), bill amount from the same months (BILL_AMTn), and amount of previous payments (PAY_AMTn). Where n corresponds to month before Sept 2005.

Preliminary Exploratory Data Analysis¶

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(ggplot2)
require(gridExtra)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

✔ broom        1.0.5     ✔ rsample      1.2.0
✔ dials        1.2.0     ✔ tune         1.1.2
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.1     ✔ yardstick    1.2.0
✔ recipes      1.0.8     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

Loading required package: gridExtra


Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine


In [2]:
data_url <- url("https://raw.githubusercontent.com/mlool/dsci-100-2023W1-group-008-31/main/data/UCI_Credit_Card.csv")
credit_card_data <- read_csv(data_url)
credit_card_data
Rows: 30000 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (25): ID, LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A spec_tbl_df: 30000 × 25
IDLIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_0PAY_2PAY_3PAY_4⋯BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6default.payment.next.month
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1 2000022124 22-1-1⋯ 0 0 0 0 689 0 0 0 01
212000022226-12 0 0⋯ 3272 3455 3261 0100010001000 020001
3 9000022234 00 0 0⋯1433114948155491518150010001000100050000
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋱⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
2999830000122374 32-1⋯208782058219357 0 0220004200 200031001
2999980000131411-10 0⋯527741185548944859003409 117819265296418041
3000050000121460 00 0⋯365353242815313 20781800 14301000 100010001

Observations: We notice that PAY skiped PAY_1 and added a PAY_0, should to keep things consistent, I will rename the column. Furthermore, we observe some columns should be of type category (fct) based on the dataset website, so I will mutate those columns as well. We also filter out all category data that doesn't have a corresponding status.

In [3]:
category_colnames <- c("SEX", "EDUCATION", "MARRIAGE", "default.payment.next.month")
sex_categories <- c(1, 2)
education_categories <- c(1, 2, 3, 4, 5, 6)
marriage_status <- c(1, 2, 3)
pay_status <- c(-1, 1, 2, 3, 4, 5, 6, 7, 8, 9)
credit_card_tidy <- credit_card_data |>
                        rename(PAY_1 = PAY_0) |>
                        mutate(across(category_colnames, ~as_factor(.x))) |>
                        filter(SEX %in% sex_categories,
                               EDUCATION %in% education_categories,
                               MARRIAGE %in% marriage_status,
                               PAY_1 %in% pay_status,
                               PAY_2 %in% pay_status,
                               PAY_3 %in% pay_status,
                               PAY_4 %in% pay_status,
                               PAY_5 %in% pay_status,
                               PAY_6 %in% pay_status,)
credit_card_tidy
Warning message:
“There was 1 warning in `mutate()`.
ℹ In argument: `across(category_colnames, ~as_factor(.x))`.
Caused by warning:
! Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(category_colnames)

  # Now:
  data %>% select(all_of(category_colnames))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.”
A tibble: 4047 × 25
IDLIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_1PAY_2PAY_3PAY_4⋯BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6default.payment.next.month
<dbl><dbl><fct><fct><fct><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><fct>
1226000021251-1-1-1-1⋯85172228713668218189966858322301 036400
2212000022139-1-1-1-1⋯ 0 632 316 316 316 0 632316 01
29 5000023147-1-1-1-1⋯204030430 257 34153421204430430257 00
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋱⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
29977 40000122472232⋯5125947151469344000 02000 03520 01
29992210000121343222⋯ 2500 2500 2500 0 0 0 0 0 01
29995 80000122342222⋯77519826078115870003500 07000 040001
In [4]:
credit_card_split <- initial_split(credit_card_tidy, prop = 0.75, strata = default.payment.next.month)
credit_card_training <- training(credit_card_split)
credit_card_testing <- testing(credit_card_split)
credit_card_training
A tibble: 3035 × 25
IDLIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_1PAY_2PAY_3PAY_4⋯BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6default.payment.next.month
<dbl><dbl><fct><fct><fct><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><fct>
1226000021251-1-1-1-1⋯ 8517222871366821818 9966 858322301 036400
29 5000023147-1-1-1-1⋯ 204030430 257 3415 3421 204430430 257 00
3123000021227-1-1-1-1⋯1533914307369231727013281153391430737292 00
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋱⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
2994313000013145-1-1-1-1⋯ 2756 1585 12151826283127561585121524021
29977 4000012247 2 2 3 2⋯5125947151469344000 02000 03520 01
2999221000012134 3 2 2 2⋯ 2500 2500 2500 0 0 0 0 0 01

Data Summarization¶

In [5]:
# Number of observation in each class
training_count <- credit_card_training |>
                    group_by(default.payment.next.month) |>
                    summarize(count = n())
training_count
A tibble: 2 × 2
default.payment.next.monthcount
<fct><int>
01958
11077

I notice there's slightly more none defaults than defaults, leading me to think perhaps we should measure our model based on precision/recall instead of accuracy alone.

In [6]:
training_sex_count <- credit_card_training |>
                        group_by(SEX, default.payment.next.month) |>
                        summarize(count = n()) |>
                        group_by(SEX) |>
                        summarize(percent_of_default = sum(count[default.payment.next.month == 1]) / sum(count)) |>
                        mutate(SEX = case_when(SEX == 1 ~ "male", SEX == 2 ~ "female"))

training_sex_count
`summarise()` has grouped output by 'SEX'. You can override using the `.groups`
argument.
A tibble: 2 × 2
SEXpercent_of_default
<chr><dbl>
male 0.3900482
female0.3303522

I notice that male tends to have a slightly higher chance to default, but difference is small, so unlikely to use this column, as it would introduce gender discrimination in our model, which we want to avoid.

In [7]:
training_marriage_count <- credit_card_training |>
                            group_by(MARRIAGE, default.payment.next.month) |>
                            summarize(count = n()) |>
                            group_by(MARRIAGE) |>
                            summarize(percent_of_default = sum(count[default.payment.next.month == 1]) / sum(count)) |>
                            mutate(MARRIAGE = case_when(MARRIAGE == 1 ~ "married", 
                                                        MARRIAGE == 2 ~ "single",
                                                        MARRIAGE == 3 ~ "other"
                                                       ))

training_marriage_count
`summarise()` has grouped output by 'MARRIAGE'. You can override using the
`.groups` argument.
A tibble: 3 × 2
MARRIAGEpercent_of_default
<chr><dbl>
married0.3432642
single 0.3649386
other 0.4800000

I notice that single and other have a slightly higher chance to default, however the difference is small, so might not be an important factor

In [8]:
training_education_count <- credit_card_training |>
                            group_by(EDUCATION, default.payment.next.month) |>
                            summarize(count = n()) |>
                            group_by(EDUCATION) |>
                            summarize(percent_of_default = sum(count[default.payment.next.month == 1]) / sum(count)) |>
                            mutate(EDUCATION = case_when(EDUCATION == 1 ~ "grad", 
                                                        EDUCATION == 2 ~ "university",
                                                        EDUCATION == 3 ~ "high school",
                                                        EDUCATION == 4 ~ "others",
                                                        EDUCATION == 5 ~ "unknown",
                                                        EDUCATION == 6 ~ "unknown",
                                                       ))

print(training_education_count)
`summarise()` has grouped output by 'EDUCATION'. You can override using the
`.groups` argument.
# A tibble: 6 × 2
  EDUCATION   percent_of_default
  <chr>                    <dbl>
1 grad                     0.287
2 university               0.421
3 high school              0.371
4 others                   0    
5 unknown                  0    
6 unknown                  0    

I notice that grad has a lower chance to default than university and high school, although difference is small between university and highschool, grad is quite a big difference, so might be an important factor

Data Visualization¶

In [9]:
options(repr.plot.width = 12, repr.plot.height = 10)

pay_1_plot <- credit_card_training |> 
                ggplot(aes(x = PAY_1, fill = default.payment.next.month)) +
                geom_bar(stat = "count", position = "fill") +
                labs(x = "pay status (Sep 2005)", y = "percent default") + 
                theme(legend.position="bottom")

pay_2_plot <- credit_card_training |> 
                ggplot(aes(x = PAY_2, fill = default.payment.next.month)) +
                geom_bar(stat = "count", position = "fill") +
                labs(x = "pay status (Aug 2005)", y = "") + 
                theme(legend.position="bottom")

pay_3_plot <- credit_card_training |> 
                ggplot(aes(x = PAY_3, fill = default.payment.next.month)) +
                geom_bar(stat = "count", position = "fill") +
                labs(x = "pay status (Jul 2005)", y = "") + 
                theme(legend.position="bottom")

pay_4_plot <- credit_card_training |> 
                ggplot(aes(x = PAY_4, fill = default.payment.next.month)) +
                geom_bar(stat = "count", position = "fill") +
                labs(x = "pay status (Jun 2005)", y = "percent default") + 
                theme(legend.position="bottom")

pay_5_plot <- credit_card_training |> 
                ggplot(aes(x = PAY_5, fill = default.payment.next.month)) +
                geom_bar(stat = "count", position = "fill") +
                labs(x = "pay status (May 2005)", y = "") + 
                theme(legend.position="bottom")

pay_6_plot <- credit_card_training |> 
                ggplot(aes(x = PAY_6, fill = default.payment.next.month)) +
                geom_bar(stat = "count", position = "fill") +
                labs(x = "pay status (Apr 2005)") +
                theme(legend.position="bottom")

grid.arrange(pay_1_plot, pay_2_plot, pay_3_plot, pay_4_plot, pay_5_plot, pay_6_plot, nrow = 2, ncol = 3)
No description has been provided for this image

We see those who do not pay duly tend to default more than those who does, the difference is much quite significant so this might be an important factor in our model.

In [10]:
# Offset the x and y since you pay for the previous month
plot_1 <- ggplot(credit_card_training, aes(x = BILL_AMT2, y = PAY_AMT1, color = default.payment.next.month )) +
            geom_point(alpha=0.5) +
            labs(x = "Bill Amount (Aug 2005)",
                 y = "Pay Amount (Sep 2005)",
                 color="Default Payment Next Month (1=yes, 0=no)") + 
            theme(legend.position="bottom")

plot_2 <- ggplot(credit_card_training, aes(x = BILL_AMT3, y = PAY_AMT2, color = default.payment.next.month )) +
            geom_point(alpha=0.5) +
            labs(x = "Bill Amount (Jul 2005)",
                 y = "Pay Amount (Aug 2005)",
                 color="Default Payment Next Month (1=yes, 0=no)") + 
            theme(legend.position="bottom")

plot_3 <- ggplot(credit_card_training, aes(x = BILL_AMT4, y = PAY_AMT3, color = default.payment.next.month )) +
            geom_point(alpha=0.5) +
            labs(x = "Bill Amount (Jun 2005)",
                 y = "Pay Amount (July 2005)",
                 color="Default Payment Next Month (1=yes, 0=no)") + 
            theme(legend.position="bottom")

plot_4 <- ggplot(credit_card_training, aes(x = BILL_AMT5, y = PAY_AMT4, color = default.payment.next.month )) +
            geom_point(alpha=0.5) +
            labs(x = "Bill Amount (May 2005)",
                 y = "Pay Amount (Jun 2005)",
                 color="Default Payment Next Month (1=yes, 0=no)") + 
            theme(legend.position="bottom")

plot_5 <- ggplot(credit_card_training, aes(x = BILL_AMT6, y = PAY_AMT5, color = default.payment.next.month )) +
            geom_point(alpha=0.5) +
            labs(x = "Bill Amount (Apr 2005)",
                 y = "Pay Amount (May 2005)",
                 color="Default Payment Next Month (1=yes, 0=no)") + 
            theme(legend.position="bottom")

grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, nrow = 3, ncol = 2)
No description has been provided for this image

I notice that those who pay the full amount of the previous bill tends not to default (straight line) but those who don't pay tends to default (low pay amount). This would also be an important predictor in our model, however, we might need to mutate the data a bit to create something like "percent of payment" for the model to better understand that the two columns are closely related.

Summary¶

Although there is quite a bit of cleaning needed, the preliminary data analysis leads to promising results. Through data summaries, I believe education may be an important factor while sex and marriage not as much. And in terms of numerical data, we find whether or not a person defaults this month is a predictor for next month. Further, the relationship between pay amount and bill is also important in predicting next month's pay status. This model is important since we will be able to predict possible defaults, which is a factor in considering lending loans for the bank. However, from the data analysis, there are some questions regarding how best to measure the success of the model as well as how to combine the columns for a better indicator.

In [ ]: